N-Grams

An N-Gram is an ordered sequence of words. For example:

In the following series of quizes, you will work with 2-grams, or bigrams, as they are more commonly called.
The objective is to create a function that calculates the probability that a particular sentence
could occur in a corpus of text, based on the probabilities of its component bigrams. We'll do this in stages though:

Quiz 1 - Extract tokens and bigrams from a sentence
Quiz 2 - Calculate probabilities for bigrams
Quiz 3 - Calculate the log probability of a given sentence based on a corpus of text using bigrams

Assumptions and terminology

We will assume that text data is in the form of sentences with no punctuation. If a sentence is in a single line, we will add add a token for
start of sentence: <s> and end of sentence: </s>. For example, if the sentence is "I love language models." it will appear in code as:

'I love language models'

The tokens for this sentence are represented as an ordered list of the lower case words plus the start and end sentence tags:

tokens = ['<s>', 'i', 'love', 'language', 'models', '</s>']

The bigrams for this sentence are represented as a list of lower case ordered pairs of tokens:

bigrams = [('<s>', 'i'), ('i', 'love'), ('love', 'language'), ('language', 'models'), ('models', '</s>')]

Quiz 1 Instructions

In the quiz below, write a function that returns a list of tokens and a list of bigrams for a given sentence. You will need to first break a sentence into words in a list, then add a <s> and <s/> token to the
start and end of the list to represent the start and end of the sentence.

Your final lists should be in the format shown above and called out in the function doc string.

Start Quiz:

function.py solution.py

test_sentences = [
    'the old man spoke to me',
    'me to spoke man old the',
    'old man me old man me',
]

def sentence_to_bigrams(sentence):
    """
    Add start '<s>' and stop '</s>' tags to the sentence and tokenize it into a list
    of lower-case words (sentence_tokens) and bigrams (sentence_bigrams)
    :param sentence: string
    :return: list, list
        sentence_tokens: ordered list of words found in the sentence
        sentence_bigrams: a list of ordered two-word tuples found in the sentence
    """
    #TODO implement
    sentence_tokens = None
    sentence_bigrams = None
    return sentence_tokens, sentence_bigrams

test_sentences = [
    'the old man spoke to me',
    'me to spoke man old the',
    'old man me old man me',
]

def sentence_to_bigrams(sentence):
    """
    Add start '<s>' and stop '</s>' tags to the sentence and tokenize it into a list
    of lower-case words (sentence_tokens) and bigrams (sentence_bigrams)
    :param sentence: string
    :return: list, list
        sentence_tokens: ordered list of words found in the sentence
        sentence_bigrams: a list of ordered two-word tuples found in the sentence
    """
    sentence_tokens = ['<s>'] + sentence.lower().split() + ['</s>']
    sentence_bigrams = []
    for i in range(len(sentence_tokens)-1):
        sentence_bigrams.append((sentence_tokens[i], sentence_tokens[i+1]))
    return sentence_tokens, sentence_bigrams

Probabilities and Likelihoods with Bigrams

Recall from a previous video that the probability of a series of words
can be calculated from the chained probabilities of its history:

The probabilities of sequence occurrences in a large textual corpus can be calculated this
way and used as a language model to add grammar and contectual knowledge to a speech
recognition system. However, there is a prohibitively large number of calculations for all the
possible sequences of varying length in a large textual corpus.

To address this problem, we use the Markov Assumption to approximate
a sequence probability with a shorter sequence:

In the bigram case, the equation reduces to a series of bigram probabilities multiplied together to find the approximate probability for a sentence. A concrete example:

\qquad \qquad P("I\: Iove\: language\: models") \approx

\qquad \qquad \qquad \qquad P("love"|"I")P("language"|"love")P("models"|"language")

We can calculate the probabilities by using counts of the bigramsand individual tokens. The counts are represented below with the c() operator:

In Python, the Counter method is useful for this task

from collections import Counter
# Sentence: "I am as I am"
tokens = ['<s>', 'i', 'am', 'as', 'i', 'am', '</s>']
token_counts = Counter(tokens)
print(token_counts)
# output:
# Counter({'</s>': 1, '<s>': 1, 'am': 2, 'as': 1, 'i': 2})

Quiz 2 Instructions

In the quiz below, write a function that returns a probability dictionary when given a lists of tokens and bigrams.

Start Quiz:

function.py utils.py solution.py transcripts.txt

from collections import Counter
import utils


def sample_run():
    # sample usage by test code (this definition not actually run for the quiz)
    tokens, bigrams = utils.bigrams_from_transcript('transcripts.txt')
    bg_dict = bigram_mle(tokens, bigrams)
    print(bg_dict)


def bigram_mle(tokens, bigrams):
    """
    provide a dictionary of probabilities for all bigrams in a corpus of text
    the calculation is based on maximum likelihood estimation and does not include
    any smoothing.  A tag '<unk>' has been added for unknown probabilities.
    :param tokens: list
        tokens: list of all tokens in the corpus
    :param bigrams: list
        bigrams: list of all two word tuples in the corpus
    :return: dict
        bg_mle_dict: a dictionary of bigrams:
            key: tuple of two bigram words, in order OR <unk> key
            value: float probability

    """
    bg_mle_dict = {}
    bg_mle_dict['<unk>'] = 0.
    #TODO implement
    return bg_mle_dict

def bigrams_from_transcript(filename):
    """
    read a file of sentences, adding start '<s>' and stop '</s>' tags; Tokenize it into a list of lower case words
    and bigrams
    :param filename: string 
        filename: path to a text file consisting of lines of non-puncuated text; assume one sentence per line
    :return: list, list
        tokens: ordered list of words found in the file
        bigrams: a list of ordered two-word tuples found in the file
    """
    tokens = []
    bigrams = []
    with open(filename, 'r') as f:
        for line in f:
            line_tokens, line_bigrams = sentence_to_bigrams(line)
            tokens = tokens + line_tokens
            bigrams = bigrams + line_bigrams
    return tokens, bigrams


def sentence_to_bigrams(sentence):
    """
    Add start '<s>' and stop '</s>' tags to the sentence and tokenize it into a list
    of lower-case words (sentence_tokens) and bigrams (sentence_bigrams)
    :param sentence: string
    :return: list, list
        sentence_tokens: ordered list of words found in the sentence
        sentence_bigrams: a list of ordered two-word tuples found in the sentence
    """
    sentence_tokens = ['<s>'] + sentence.lower().split() + ['</s>']
    sentence_bigrams = []
    for i in range(len(sentence_tokens)-1):
        sentence_bigrams.append((sentence_tokens[i], sentence_tokens[i+1]))
    return sentence_tokens, sentence_bigrams

from collections import Counter
import utils

def bigram_mle(tokens, bigrams):
    """
    provide a dictionary of probabilities for all bigrams in a corpus of text
    the calculation is based on maximum likelihood estimation and does not include
    any smoothing.  A tag '<unk>' has been added for unknown probabilities.
    :param tokens: list
        tokens: list of all tokens in the corpus
    :param bigrams: list
        bigrams: list of all two word tuples in the corpus
    :return: dict
        bg_mle_dict: a dictionary of bigrams:
            key: tuple of two bigram words, in order OR <unk> key
            value: float probability
            
    """
    bg_mle_dict = {}
    bg_mle_dict['<unk>'] = 0.

    token_raw_counts = Counter(tokens)
    bigram_raw_counts = Counter(bigrams)
    for bg in bigram_raw_counts:
        bg_mle_dict[bg] = bigram_raw_counts[bg] / token_raw_counts[bg[0]]
    return bg_mle_dict

GO DO YOU HEAR
BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN SEEMED CENTRED IN HIS EYES WHICH BECAME BLOODSHOT THE VEINS OF THE THROAT SWELLED HIS CHEEKS AND TEMPLES BECAME PURPLE AS THOUGH HE WAS STRUCK WITH EPILEPSY NOTHING WAS WANTING TO COMPLETE THIS BUT THE UTTERANCE OF A CRY
AND THE CRY ISSUED FROM HIS PORES IF WE MAY THUS SPEAK A CRY FRIGHTFUL IN ITS SILENCE
DAVRIGNY RUSHED TOWARDS THE OLD MAN AND MADE HIM INHALE A POWERFUL RESTORATIVE
DAVRIGNY UNABLE TO BEAR THE SIGHT OF THIS TOUCHING EMOTION TURNED AWAY AND VILLEFORT WITHOUT SEEKING ANY FURTHER EXPLANATION AND ATTRACTED TOWARDS HIM BY THE IRRESISTIBLE MAGNETISM WHICH DRAWS US TOWARDS THOSE WHO HAVE LOVED THE PEOPLE FOR WHOM WE MOURN EXTENDED HIS HAND TOWARDS THE YOUNG MAN
FOR SOME TIME NOTHING WAS HEARD IN THAT CHAMBER BUT SOBS EXCLAMATIONS AND PRAYERS
WHAT DO YOU MEAN SIR
OH YOU RAVE SIR EXCLAIMED VILLEFORT IN VAIN ENDEAVORING TO ESCAPE THE NET IN WHICH HE WAS TAKEN I RAVE
DO YOU KNOW THE ASSASSIN ASKED MORREL
NOIRTIER LOOKED UPON MORREL WITH ONE OF THOSE MELANCHOLY SMILES WHICH HAD SO OFTEN MADE VALENTINE HAPPY AND THUS FIXED HIS ATTENTION
SAID MORREL SADLY YES REPLIED NOIRTIER
THE OLD MANS EYES REMAINED FIXED ON THE DOOR
ASKED MORREL YES
MUST I LEAVE ALONE NO
BUT CAN HE UNDERSTAND YOU YES
GENTLEMEN HE SAID IN A HOARSE VOICE GIVE ME YOUR WORD OF HONOR THAT THIS HORRIBLE SECRET SHALL FOREVER REMAIN BURIED AMONGST OURSELVES THE TWO MEN DREW BACK
MY FATHER HAS REVEALED THE CULPRITS NAME MY FATHER THIRSTS FOR REVENGE AS MUCH AS YOU DO YET EVEN HE CONJURES YOU AS I DO TO KEEP THIS SECRET DO YOU NOT FATHER
MORREL SUFFERED AN EXCLAMATION OF HORROR AND SURPRISE TO ESCAPE HIM
THE OLD MAN MADE A SIGN IN THE AFFIRMATIVE
IT WAS SOMETHING TERRIBLE TO WITNESS THE SILENT AGONY THE MUTE DESPAIR OF NOIRTIER WHOSE TEARS SILENTLY ROLLED DOWN HIS CHEEKS
BUT HE STOPPED ON THE LANDING HE HAD NOT THE COURAGE TO AGAIN VISIT THE DEATH CHAMBER
THE TWO DOCTORS THEREFORE ENTERED THE ROOM ALONE
NOIRTIER WAS NEAR THE BED PALE MOTIONLESS AND SILENT AS THE CORPSE
THE DISTRICT DOCTOR APPROACHED WITH THE INDIFFERENCE OF A MAN ACCUSTOMED TO SPEND HALF HIS TIME AMONGST THE DEAD HE THEN LIFTED THE SHEET WHICH WAS PLACED OVER THE FACE AND JUST UNCLOSED THE LIPS
THE NEAREST SAID THE DISTRICT DOCTOR IS A GOOD ITALIAN ABBE WHO LIVES NEXT DOOR TO YOU SHALL I CALL ON HIM AS I PASS
DAVRIGNY SAID VILLEFORT BE SO KIND I BESEECH YOU AS TO ACCOMPANY THIS GENTLEMAN HERE IS THE KEY OF THE DOOR SO THAT YOU CAN GO IN AND OUT AS YOU PLEASE YOU WILL BRING THE PRIEST WITH YOU AND WILL OBLIGE ME BY INTRODUCING HIM INTO MY CHILDS ROOM DO YOU WISH TO SEE HIM
I ONLY WISH TO BE ALONE YOU WILL EXCUSE ME WILL YOU NOT
I AM GOING SIR AND I DO NOT HESITATE TO SAY THAT NO PRAYERS WILL BE MORE FERVENT THAN MINE

Smoothing and logs

There are still a couple of problems to sort out before we use the bigram probability dictionary to calculate the probabilities of new sentences:

1. Some possible combinations may not exist in our probability dictionary but are still possible. We don't want to multiply in a probability of 0 just because our original corpus was deficient. This is solved through "smoothing". There are a number of methods for this, but a simple one is the Laplace smoothing with the "add-one" estimate where V is the size of the vocabulary for the corpus, i.e. the number of unique tokens:

2. Repeated multiplications of small probabilities can cause underflow problems in computers when

the values become to small. To solve this, we will calculate all probabilities in log space:

\qquad \qquad \qquad log(p_1\times p_2\times p_3\times p_4) = \log p_1 + \log p_2 + \log p_3 + \log p_4

Quiz 3 Instructions

In the following quiz, a utility named utils.bigram_add1_logs has been added for you with Laplace smoothing in the log space. Write a function that calculates the log probability for a given sentence, using this log probability dictionary. If all goes well, you should observe that more likely sentences yield higher values for the log probabilities.

Start Quiz:

function.py utils.py transcripts.txt solution.py

import utils

test_sentences = [
    'the old man spoke to me',
    'me to spoke man old the',
    'old man me old man me',
]

def sample_run():
    # sample usage by test code (this definition not actually run for the quiz)
    bigram_log_dict = utils.bigram_add1_logs('transcripts.txt')
    for sentence in test_sentences:
        print('*** "{}"'.format(sentence))
        print(log_prob_of_sentence(sentence, bigram_log_dict))

def log_prob_of_sentence(sentence, bigram_log_dict):
    total_log_prob = 0.

    # TODO implement
    # get the sentence bigrams with utils.sentence_to_bigrams
    # look up the bigrams from the sentence in the bigram_log_dict
    # add all the the log probabilities together
    # if a word doesn't exist, be sure to use the value of the '<unk>' lookup instead

    return total_log_prob

from collections import Counter
import numpy as np

def bigrams_from_transcript(filename):
    """
    read a file of sentences, adding start '<s>' and stop '</s>' tags; Tokenize it into a list of lower case words
    and bigrams
    :param filename: string 
        filename: path to a text file consisting of lines of non-puncuated text; assume one sentence per line
    :return: list, list
        tokens: ordered list of words found in the file
        bigrams: a list of ordered two-word tuples found in the file
    """
    tokens = []
    bigrams = []
    with open(filename, 'r') as f:
        for line in f:
            line_tokens, line_bigrams = sentence_to_bigrams(line)
            tokens = tokens + line_tokens
            bigrams = bigrams + line_bigrams
    return tokens, bigrams


def sentence_to_bigrams(sentence):
    """
    Add start '<s>' and stop '</s>' tags to the sentence and tokenize it into a list
    of lower-case words (sentence_tokens) and bigrams (sentence_bigrams)
    :param sentence: string
    :return: list, list
        sentence_tokens: ordered list of words found in the sentence
        sentence_bigrams: a list of ordered two-word tuples found in the sentence
    """
    sentence_tokens = ['<s>'] + sentence.lower().split() + ['</s>']
    sentence_bigrams = []
    for i in range(len(sentence_tokens)-1):
        sentence_bigrams.append((sentence_tokens[i], sentence_tokens[i+1]))
    return sentence_tokens, sentence_bigrams

def bigram_add1_logs(transcript_file):
    """
    provide a smoothed log probability dictionary based on a transcript
    :param transcript_file: string
        transcript_file is the path filename containing unpunctuated text sentences
    :return: dict
        bg_add1_log_dict: dictionary of smoothed bigrams log probabilities including
        tags: <s>: start of sentence, </s>: end of sentence, <unk>: unknown placeholder probability
    """

    tokens, bigrams = bigrams_from_transcript(transcript_file)
    token_counts = Counter(tokens)
    bigram_counts = Counter(bigrams)
    vocab_count = len(token_counts)

    bg_addone_dict = {}
    for bg in bigram_counts:
        bg_addone_dict[bg] = np.log((bigram_counts[bg] + 1.) / (token_counts[bg[0]] + vocab_count))
    bg_addone_dict['<unk>'] = np.log(1. / vocab_count)
    return bg_addone_dict

GO DO YOU HEAR
BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN SEEMED CENTRED IN HIS EYES WHICH BECAME BLOODSHOT THE VEINS OF THE THROAT SWELLED HIS CHEEKS AND TEMPLES BECAME PURPLE AS THOUGH HE WAS STRUCK WITH EPILEPSY NOTHING WAS WANTING TO COMPLETE THIS BUT THE UTTERANCE OF A CRY
AND THE CRY ISSUED FROM HIS PORES IF WE MAY THUS SPEAK A CRY FRIGHTFUL IN ITS SILENCE
DAVRIGNY RUSHED TOWARDS THE OLD MAN AND MADE HIM INHALE A POWERFUL RESTORATIVE
DAVRIGNY UNABLE TO BEAR THE SIGHT OF THIS TOUCHING EMOTION TURNED AWAY AND VILLEFORT WITHOUT SEEKING ANY FURTHER EXPLANATION AND ATTRACTED TOWARDS HIM BY THE IRRESISTIBLE MAGNETISM WHICH DRAWS US TOWARDS THOSE WHO HAVE LOVED THE PEOPLE FOR WHOM WE MOURN EXTENDED HIS HAND TOWARDS THE YOUNG MAN
FOR SOME TIME NOTHING WAS HEARD IN THAT CHAMBER BUT SOBS EXCLAMATIONS AND PRAYERS
WHAT DO YOU MEAN SIR
OH YOU RAVE SIR EXCLAIMED VILLEFORT IN VAIN ENDEAVORING TO ESCAPE THE NET IN WHICH HE WAS TAKEN I RAVE
DO YOU KNOW THE ASSASSIN ASKED MORREL
NOIRTIER LOOKED UPON MORREL WITH ONE OF THOSE MELANCHOLY SMILES WHICH HAD SO OFTEN MADE VALENTINE HAPPY AND THUS FIXED HIS ATTENTION
SAID MORREL SADLY YES REPLIED NOIRTIER
THE OLD MANS EYES REMAINED FIXED ON THE DOOR
ASKED MORREL YES
MUST I LEAVE ALONE NO
BUT CAN HE UNDERSTAND YOU YES
GENTLEMEN HE SAID IN A HOARSE VOICE GIVE ME YOUR WORD OF HONOR THAT THIS HORRIBLE SECRET SHALL FOREVER REMAIN BURIED AMONGST OURSELVES THE TWO MEN DREW BACK
MY FATHER HAS REVEALED THE CULPRITS NAME MY FATHER THIRSTS FOR REVENGE AS MUCH AS YOU DO YET EVEN HE CONJURES YOU AS I DO TO KEEP THIS SECRET DO YOU NOT FATHER
MORREL SUFFERED AN EXCLAMATION OF HORROR AND SURPRISE TO ESCAPE HIM
THE OLD MAN MADE A SIGN IN THE AFFIRMATIVE
IT WAS SOMETHING TERRIBLE TO WITNESS THE SILENT AGONY THE MUTE DESPAIR OF NOIRTIER WHOSE TEARS SILENTLY ROLLED DOWN HIS CHEEKS
BUT HE STOPPED ON THE LANDING HE HAD NOT THE COURAGE TO AGAIN VISIT THE DEATH CHAMBER
THE TWO DOCTORS THEREFORE ENTERED THE ROOM ALONE
NOIRTIER WAS NEAR THE BED PALE MOTIONLESS AND SILENT AS THE CORPSE
THE DISTRICT DOCTOR APPROACHED WITH THE INDIFFERENCE OF A MAN ACCUSTOMED TO SPEND HALF HIS TIME AMONGST THE DEAD HE THEN LIFTED THE SHEET WHICH WAS PLACED OVER THE FACE AND JUST UNCLOSED THE LIPS
THE NEAREST SAID THE DISTRICT DOCTOR IS A GOOD ITALIAN ABBE WHO LIVES NEXT DOOR TO YOU SHALL I CALL ON HIM AS I PASS
DAVRIGNY SAID VILLEFORT BE SO KIND I BESEECH YOU AS TO ACCOMPANY THIS GENTLEMAN HERE IS THE KEY OF THE DOOR SO THAT YOU CAN GO IN AND OUT AS YOU PLEASE YOU WILL BRING THE PRIEST WITH YOU AND WILL OBLIGE ME BY INTRODUCING HIM INTO MY CHILDS ROOM DO YOU WISH TO SEE HIM
I ONLY WISH TO BE ALONE YOU WILL EXCUSE ME WILL YOU NOT
I AM GOING SIR AND I DO NOT HESITATE TO SAY THAT NO PRAYERS WILL BE MORE FERVENT THAN MINE

import ngram_quiz_3.utils as utils

test_sentences = [
    'the old man spoke to me',
    'me to spoke man old the',
    'old man me old man me',
]


def log_prob_of_sentence(sentence, bigram_log_dict):
    # get the sentence bigrams
    s_tokens, s_bigrams = utils.sentence_to_bigrams(sentence)

    # add the log probabilites of the bigrams in the sentence
    total_log_prob = 0.
    for bg in s_bigrams:
        if bg in bigram_log_dict:
            total_log_prob = total_log_prob + bigram_log_dict[bg]
        else:
            total_log_prob = total_log_prob + bigram_log_dict['<unk>']
    return total_log_prob

Next Concept